Goto

Collaborating Authors

 impossible task


ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Zhong, Ziqian, Raghunathan, Aditi, Carlini, Nicholas

arXiv.org Artificial Intelligence

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.


BSBench: will your LLM find the largest prime number?

Erziev, K. O. T.

arXiv.org Artificial Intelligence

With large language models' (LLMs) continued successes in achieving high scores on various benchmarks [Ope25; Dee+25; Ant25; Tea+25], there still remains a question of how well these scores translate into real-world performance. In the real world there are often questions with no solutions because problems are either underdetermined, overdetermined or simply ill-posed. The ability to ask right questions (and filter out the fluff before answering them) is arguably no less valuable than the ability to answer the questions with given answers. This is in a stark contrast with current benchmark evaluations (and training [Dee+25; Lam+25]) approach, which are supposed to be crafted carefully enough to have at least a single unambiguous solution. We propose that the models should be systematically tested for the existence of such a "bias", which, if present, might translate into models always trying to find a solution, even when the right thing is to say that the question is ill-posed, and in turn sabotage the potential for success of (semi-)autonomy envisioned for the agents built upon these models.


D-Wave says its quantum computers can solve otherwise impossible tasks

New Scientist

Quantum computers can now solve problems with real-world applications faster than any ordinary computer, suggesting they could be commercially viable, say researchers at quantum computing firm D-Wave – though outside observers are more cautious. It had long been hoped that quantum computers will be able to perform some tasks that are impractical or impossible on even the best supercomputers. Google was the first to demonstrate this "quantum supremacy" in 2019, but only for a somewhat contrived benchmark test with no practical use.


The 40 Best Movies on Netflix This Week

WIRED

Netflix has plenty of movies to watch, but it's a real mixed bag. Sometimes finding the right film at the right time can seem like an impossible task. Fret not, we're here to help. Below is a list of some of our favorite films currently on the streaming service--from dramas to comedies to thrillers. If you decide you're in more of a TV mood, head over to our collection of the best TV series on Netflix. Check out our lists of the best sci-fi movies, best movies on Amazon Prime, and the best flicks on Disney . It's easy to imagine that the elevator pitch for The Sea Beast was "Moby Dick meets How to Train Your Dragon"--and who wouldn't be compelled by that? Set in a fantasy world where oceanic leviathans terrorize humanity, those who hunt down the giant monsters are lauded as heroes.


Artificial Intelligence vs. Natural Stupidity - spxbot blog

#artificialintelligence

I imagine that many readers interested in the Artificial Intelligence topic are getting used to a very romantic view of the argument. The majority of the articles you may read present anxious questions about a technocratic future generated and managed by AI driven machines. Not at all, but this is the romantic side of the story. Many of us will surely loose their jobs due to "intelligent" machines, quite soon. The production/consumption paradigm is rapidly changing, as society adopts growing levels of robotics.


OpenAI wants to make safe AI, but that may be an impossible task.

#artificialintelligence

True artificial intelligence is on its way, and we aren't ready for it. Just as our forefathers had trouble visualizing everything from the modern car to the birth of the computer, it's difficult for most people to imagine how much truly intelligent technology could change our lives as soon as the next decade -- and how much we stand to lose if AI goes out of our control. Fortunately, there's a league of individuals working to ensure that the birth of artificial intelligence isn't the death of humanity. From Max Tegmark's Future of Life Institute to the Harvard Kennedy School of Government's Future Society, the world's most renowned experts are joining forces to tackle one of the most disruptive technological advancements (and greatest threats) humanity will ever face. Perhaps the most famous organization to be born from this existential threat is OpenAI.


Eye of the beholder: the impossible task of making creative A.I.

#artificialintelligence

When Hello Games, an independent gaming studio in sleepy Surrey announced to the world that they were building a space exploration video game which would feature more than 18 quintillion unique planets to explore, the gaming press exploded with hype. But when the game was eventually released, it was met with a tsunami of disappointment and criticism. The Grid, an A.I. startup that promised to deliver a fully automated website builder (and a respected rival of my own business, Firedrop), fell into a similar trap as its heady balloon of voracious hype eventually burst as the end product failed to match consumer expectations. Neither product was bad, however (nor are they dead, both have done quite well commercially as far as we know). They were both victims of excessive hype which, as we all know, is a predictable prelude to mass disappointment.